Suicide, as one of the leading cause of death, is the behavior of intentionally causing one’s own death. Nowadays, as the society develops at an increasingly rapid rate, more suicide incidents occur due to the increasing amount of stress and mental disorder illness including depression, anxiety disorders and so on.
To construct a system to analyze the suicide rate in different countries, age group, sex, classify the data and predict the rate in the future, statistical learning techniques have been applied to a dataset containing the suicide data from 2000 to 2016. The results show potential for such a system to be used to analyze the relationship between age, sex, country GDP per capital and the suicide rate and then to predict the rate in each country, especially given that the dataset utilized is from the official dataset provided by WHO. The results indicate that this prediction can be made with a reasonably small amount of error. However, practical and statistical limitations suggest the need for further investigation.
This Kaggle 1 dataset pulled from four other datasets: United Nations Development Program 2, World Bank 3, Suicide in the Twenty-First Century 4, and World Health Organization 5 and summarized a set of potenial factors that influence the suicide rates across the world.
There are a total of 12 variables with 27,820 observations.
Attribute Information is listed below (The target variable is highlighted in red):
*See Appendix for variable description
Variable suicide/100k pop is calculated from
suicides_no and population, so both variables
are eliminated due to collinearity. Also, country-year
replicates variables country and year and is
removed. Gdp_for_year ($) is also not considered since it
is directly linked to gdp_per_capita ($).
Generation is an approximate variable similar to
age, so we will also not consider generation. Therefore,
the final dataset includes 7 variables.
In order to do classification of suicide rate, the suicide rate was
classified by three levels: “High”, “Medium”, “Low”. The observation in
the 75% quantile of all observations’ suicide rate are classified as
“High”, observations between 25% quantile and 75% quantile are
classified as “Medium” and the observation below 25 % quantile are
classified as “Low”.
We also concern if the problem of data imbalance among different classes occuers, so we check the data and conclude that for this dataset there is no such problem existing.
| Fatal_rate | Total Number |
|---|---|
| High | 3240 |
| Low | 3253 |
| Medium | 6441 |
In preparation for model training, a training dataset is created using 80% of the provided data.
Two regression and four classification models were trained, each using 5-fold cross-validation. A best model was chosen for each regression and classification.
A linear regression model using all predictors is used to fit the training dataset.
A kth-nearest neighbor model using all predictors is used to fit the training dataset.
A Random Forest model with oob resample method
A boosted model with cross validation resample method
A Multi nomial model with cross validation resampling method
A Neural network model with cross validation resample method
Models selection and evaluation is discussed in the results section.
For regression models, the table below shows the result of the RMSEs of the predicted suicide rates using the two regression models on the training dataset. As a result, the linear regression has a lower RMSE, and therefore, is chosen to fit the test dataset which obtains a RMSE of 12.488.
| Models | Training RMSE | Rsquared | MAE |
|---|---|---|---|
| Linear Regression | 12.476 | 0.526 | 8.242 |
| Knn Model | 16.624 | 0.181 | 10.069 |
For the classification, The table below shows the result of each model with its highest accuracy rate and confusion matrices for the training dataset. Intermediate tuning results can be found in the appendix. Due to computational limitations, only three confusion matrices are presented. While the best result can be found within the random forest model. As a result, a random forest model classification is chosen since it has the highest accuracy rate which is much higher than others’.
Models were tuned for accuracy, but sensitivity was also considering when choosing a final model. Aside from random forest model, all models had similar performance.
| Model | Accuracy |
|---|---|
| Random Forest | 0.876 |
| Boosted Model | 0.828 |
| Multinomial | 0.837 |
| Nureal Network | 0.769 |
| High | Low | Medium | |
|---|---|---|---|
| High | 20.458 | 0.951 | 3.564 |
| Low | 0.448 | 20.118 | 4.036 |
| Medium | 4.144 | 4.082 | 42.199 |
| High | Low | Medium | |
|---|---|---|---|
| High | 20.566 | 0.804 | 3.139 |
| Low | 0.464 | 20.520 | 4.005 |
| Medium | 4.020 | 3.827 | 42.655 |
| High | Low | Medium | |
|---|---|---|---|
| High | 19.337 | 0.503 | 4.059 |
| Low | 0.340 | 16.561 | 4.701 |
| Medium | 5.373 | 8.087 | 41.039 |
Within this test data, we can see the high accuracy from the confusion matrix of test data that the entries on the diagonal line are the highest ones in each row.
| High | Medium | Low | |
|---|---|---|---|
| Predicted: High | 704 | 13 | 84 |
| Predicted: Medium | 11 | 659 | 140 |
| Predicted: Low | 85 | 122 | 1416 |
While our regression model did not meet the expected outcomes, we believe this analysis demonstrates a proof-of-concept for suicide rate prediction system. Using more data, both samples and features, this model could likely be improved before being put into practice.
Below is the predicted suicide rate plot by age using the linear regression model.
Below is the predicted suicide rate plot by age using the linear model.
RMSE in the knn model is higher than the linear regression and its plot shows an obvious spread of data points. Both graphs showed people whose ages are 75 years old or above tend to have a higher suicide rate than other groups. This finding could reveal some social issues. For example, do we care enough for the elderly in our family? Do we spend enough time with them and listening to their needs? It is an alarm bell for governments and people to pay more attention to aged people. Government could consider offering more social benefits to the elder and more time and care should be given to elderly people in the family.
When choosing the metric to assess our model performance, we considered accuracy rate, specificity and sensitivity. We finally decided to use accuracy because we are more concerned about the degree to which the classification result of the suicide rate conforms to the correct value. If we were building the classification model under other situations, credit card fraud classification (genuine, fraud) for instance, we might care more about sensitivity or specificity. The most severe situation is the card that is actually fraud but is classified as genuine by mistake, which will cause a lag to the bank to solve the problem and make the person suffer from a huge loss. But the suicide rate classification is different. Since our goal is to provide suicide rate reference for each country, misclassifying any level of suicide rate will result in the country’s misinterpretation on those rates. Therefore we chose overall accuracy rate as the metric to examine our models.
We found classification results from random forest are the optimal and the visualization below shows a direct comparison between actual value and predicted value. Observations that are correctly classified are shown in blue and misclassified observations are displayed in red. We can see the majority of the observations are in blue, indicating a relatively high accuracy rate (87.04 %). We can conclude the random forest model performed well with current data. As we obtain more data in the future, this model could be improved before being put into practice.
country: Name of countryyear: Year of the suicide ratesex: gender of the suicideage: age of the suicideSuicides_no: number of suicidespopulation: Country populationsuicide/100k: pop: Number of suicide per 100,000
population, suicide rateCountry-year: Country and yearHDI for year: The Human Development Index: a statistic
composite index of life expectancy, education, and per capita income
indicatorsgdp_for_year ($): Country gdp for the yeargdp_per_capita ($): Country’s gross domestic product by
its total population.Generation: The generation range of the suicideFor additional information, see documentation on Kaggle.[^6]
| Country | Sex | Age | Mean Suicide Rate |
|---|---|---|---|
| Albania | female | 15-24 years | 4.015 |
| Albania | female | 25-34 years | 2.644 |
| Albania | female | 35-54 years | 2.164 |
| Albania | female | 5-14 years | 0.303 |
| Albania | female | 55-74 years | 1.567 |
| Albania | female | 75+ years | 3.806 |
| Country | GDP per Year | GDP per Capital | Mean Suicide Rate |
|---|---|---|---|
| Albania | 5211.661 | 1859 | 3.503 |
| Antigua and Barbuda | 803.545 | 10448 | 0.553 |
| Argentina | 274256.498 | 7914 | 10.469 |
| Armenia | 5386.592 | 1873 | 3.276 |
| Aruba | 2196.223 | 24221 | 9.503 |
| Australia | 632750.138 | 32776 | 12.993 |
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
## Warning: Use of `country_data$country` is discouraged. Use `country` instead.
## Warning: Use of `country_data$suicide_stats` is discouraged. Use `suicide_stats`
## instead.
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506↩︎
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#↩︎
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook↩︎
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/↩︎